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This article presents a probabilistic generative model for text based on semantic topics and 
syntactic classes called Part-of-Speech LDA (POSLDA). POSLDA simultaneously uncovers 
short-range syntactic patterns (syntax) and long-range semantic patterns (topics) that exist in 
document collections. This results in word distributions that are specific to both topics (sports, 
education, ...) and parts-of-speech (nouns, verbs, ...). For example, multinomial distributions 
over words are uncovered that can be understood as "nouns about weather" or "verbs about law". 
We describe the model and an approximate inference algorithm and then demonstrate the quality 
of the learned topics both qualitatively and quantitatively. Then, we discuss an NLP application 
where the output of POSLDA can lead to strong improvements in quality: unsupervised part- 
of-speech tagging. We describe algorithms for this task that make use of POSLDA-learned 
distributions that result in improved performance beyond the state of the art. 

1. Introduction 

Two highly related phenomena are resulting in a renaissance for the study of com- 
putational linguistics. The first is the increasing level of access to textual data sources 
over the Internet such as classic works of literature (The Gutenberg Project), structured 
explanatory knowledge of the world (Wikipedia), people's every thought (Twitter), 
governments' every desire (laws and legal decisions), and ongoing triumphs, defeats, 
changes, and breaking information (online news). The second, concomitant with the 
first, is the growth of interest in, and the power of, machine learning algorithms which 
can exploit the vast amounts of data that are being made available and help make sense 
of them. 

Like language itself, machine learning techniques can be described and contrasted 
with each other along a number of different axes of understanding and dichotomies. 
One of the most important of these dichotomies is the division between supervised and 
unsupervised learning approaches ( Blei, Griffiths, and Jordan 20101 . While supervised 
approaches are concerned with generalizing a function that predicts an output y given 
some input x learned from examining labeled example pairs {xi, yi), unsupervised ap- 
proaches involve uncovering hidden patterns and associations that exist in data ( HastTe^ 
Tibshirani, and Friedman 2001| . Clustering algorithms, for example, are unsupervised 
machine learning techniques that attempt to group data together because they are simi- 
lar in some way. A logical definition of similarity - especially with respect to linguistics 
- is, informally, that two texts are similar because they discuss the same topics. 
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Another canonical example of unsupervised learning is dimensionality reduction. 
In reducing the dimensionality of a dataset, an algorithm seeks to find a simpler or 
more condensed representation of the data while preserving (or brrnguig forth) some 
kind of meaning. An apt dimensionally-reduced representation of texts - collections of 
words - is the topics that they represent. In the standard bag-of-words representation 
used for many natural language processing tasks, the cardinality of the dimensional 
representation is the number of distinct words that can be used in the texts, that is, the 
vocabulary. This number is typically on the order of several thousand, while a text's 
content, in a broad sense, can also be described by the topics that it addresses out of a 
finite number of generic topics on the order of hundreds or less. 

It is therefore no surprise that probabilistic topic models, which are generative 
models of text (based on unsupervised machine learning algorithms), are continually 
growing in interest. They provide a means to uncover the hidden thematic structure 
that imderlies large document collections and therefore allow us to explore, summarize, 
and imderstand what a collection is about with ease and efficiency. Latent Dirichlet 
Allocation (LDA) ( |Blei, Ng, and Jordan 2003 1, the original topic model, describes the 
generative story that lays the foundation for this kind of model. It posits that a doc- 
ument can be created through a random process of drawing words from a mixed- 
membership model. At a high level, a document is created by first selecting the topics 
that it will address, and then randomly generating words from distributions associated 
with those topics. Numerous more complex models have been presented that build on 



this basic idea and several introductory papers exist that cover the area in detail ( Blei 
20T2I 1 



More specifically, the LDA generative process works as follows. For each document 
d, a document-specific topic portion 9d is drawn from a Dirichlet distribution. 9d is a 
discrete distribution over K topics and corresponds to the weight that each topic will 
have in the document. Then, for each word Wi, a topic index Zi is drawn from 9^- To 
generate the word, a token is drawn from a topic-specific word distribution (f>^^^ \ There 
are K topic-specific word distributions, each of which corresponds to a distribution over 
words specific to the given topic. To generate a document, however, one would require 
the word distributions specific to each topic, and for each document, one would also 
need the topic portion. Neither of these are readily available from the input, but they 
can be learned from a document collection by reversing the generative process through 
posterior inference. 

In unsupervised learning, we cannot simply write a general algorithm to find what 
we hope to be interesting patterns. The "no free lunch" theorem tells us essentially 
that no interesting patterns can be uncovered if we do not assume that certain kinds 
of patterns must exist ( Wolpert and Macready 1997| . It turns out that assuming that 
documents reflect latent topics and that words from the same topics will co-occur 
is a good assumption and thus interesting results can be learned through reversing 
the assumed model such as the words that are most important to each topic, and a 
dimensionally-reduced representation of each document in topic space. However, the 
correct number - or the type of assumptions - is important. If we make too many, the 
data may not fit those assumptions and the output will be nonsense. Conversely, if we 
make too few, interesting patterns may be missed. 

The standard document representation for topic modeling is the bag-of-words 
< |Salton and McGill 198 31. Each document is represented by the number of times that 
each word in a fixed vocabulary appears. Because word order is ignored, a great deal 
of meaning is lost, but the representation is efficient and has proved to be successful. 
However, word order - ignored in canonical topic models - is clearly of importance to 
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language because though some sense may be extracted from a text whose words have 
been scrambled, the full meaning can only come about through a parsing of the words 
based on their order and relation to each other It has also been shown that different 
parts of the brain are used to understand semantics and syntax ( |Boyd-Graber and Blei 



[20101 . Traditional topic models miss this kind of syntactical information because they 
are never exposed to word-order patterns in the first place. While the bag-of-words 
approach is efficient, further advances in NLP will require algorithms to be able to have 
the full understanding that humans are afforded. 

In this article we introduce a new probabilistic generative model called Part-of- 
Speech LDA (POSLDA) which considers both the semantic topic that a word is asso- 
ciated with (if any) and its syntactic purpose in a sentence. This approach allows a 
more structured view of language creation and as a result, the posterior distributions 
that are learned are more specific and meaningful than in other topic models. It also 
allows NLP applications to make better predictions and deductions because each piece 
of information - S5mtactic and semantic - provides evidence that can help disambiguate 
the purpose and meaning of a word. 

The article is organized as follows. In section 2, we discuss previous work that has 
focused on bringing syntactic information into probabilistic topic models. In sections, 
we explain our model, POSLDA, in detail. We then present the results of three sets 
of experiments in section 4: first, we demonstrate qualitatively the interpretability of 
the imcovered posterior distributions with example syntax-specific topics on a number 
of diverse datasets; second, we report quantitative results on the model's ability to 
generalize on unseen texts and to uncover high quality topics; and third, we show how 
the model's ability to disambiguate word use through the joint influences of semantics 
and syntax can lead to better results in unsupervised part-of-speech (POS) tagging 
than a Bayesian Hidden Markov Model (HMM). Finally, in section 5, we conclude with 
thoughts on future work. 

2. Topics and Syntax 

The constraints that are imposed by language on phrase structure and word order are 
called syntax (Manning and Schiitze 1999). The syntactic meaning of a word helps to 
explain its functional purpose in a sentence, whereas the semantic meaning is related 
to its lexical-thematic purpose. The former is based on short-range dependencies at the 
sentence level, while the latter realizes long-range dependencies at the document level. 
LDA and other topic models uncover patterns by exploiting the long-range dependen- 
cies of words co-occurring. Here, we want to add the short-range dependency structures 
to the model and we do so by focusing on modeling the functional purpose of a word in 
a sentence. We are therefore interested in the part-of-speech category that a word - in 
its given context - belongs to. These include nouns, verbs, adjectives, adverbs, prepo- 
sitions, conjunctions, etc. The canonical tool for unsupervised word syntax modeling is 
the hidden Markov model (HMM) ( Rabiner 1990 ) and it is therefore a natural place to 
begin in adding S5mtax information to a semantic topic model. 

The first work in combining S3mtactic notions of language with probabilistic topic 



models is based on embedding an LDA-like model in a single state of an HMM ( Griffiths 
et al. 2005) . This model - dubbed HMMLDA - represents an asymmetric composite 



model where all generated words follow short-range syntactic dependencies, but only 
"semantic" words that are generated from a single state obey long-range dependencies. 
Words that carry long-range dependencies will be generated given the document- 
specific topic distribution, and other words will be generated independent of the current 
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theme. This framework is different from the traditional use of the HMM in S5mtax 
modeling as each state will not correspond to a discrete part-of-speech. The content 
class in particular - which is designated as the sole class that can generate "semantic" 
words - will need to subsume nouns, verbs, adjectives, and other words that are 
topic-dependent. We will return to this simplifying assumption when we present our 
POSLDA model. 

More formally, the HMMLDA model is defined by two sets of sequential latent 
variables (zn)^^i, which represent the latent topics for each word, and {cn)n=if which 
represent the latent syntactic classes for these words. One state in the model, sq e S, is 
designated as the semantic class where the LDA-like topic model is embedded. Each 
topic k is associated with a discrete topic-word distribution 0*^*^^ and each class s ^ sg 
is associated with a S5mtax word distribution Like LDA, each document d can be 
described by a distribution over topics 6'*^''^ However, unlike LDA, each word wi only 
depends on its topic Zi if its class = sq. A word's class is modeled with the embedded 
HMM and transitions between classes and s^+i are encoded in a transition matrix tt. 
The generative process by which a document is created under the HMMLDA model is 
as follows. 



1. Draw e^'*) - Dirichlet(a) 

2. For each word Wi in document d 

(a) Draw topic Zi ^ 6'^''^ 

(b) Draw class Ci ^ tt^'^'-i^ 

(C) If Ci = Sq: 

i. Draw Wi ~ (p^^^^ 
(d) Else: 

i. Draw Wi ^ i/)^'^'^ 



Like LDA, exact posterior inference is intractable for the HMMLDA model (jGriffiths 
etal.2005| . Griffiths, et al. therefore turn to Gibbs sampling. The HMM in the HMMLDA 



is a Bayesian HMM meaning that the transition rows tt^ and the emission probabilities 
0^^' are multinomial random variables with Dirichlet priors. The same framework for 
collapsed Gibbs sampling can therefore be used as with the original LDA. 

One of the most uiterestuig qualities of the HMMLDA model is that stop-words and 
other "syntax-only" words are pulled to the non-semantic classes so that the learned 
topics are interpretable and noise-free without any need for pre-processing or a priori 
stop-word removal. This defines one of the key motivations in the development of 
POSLDA. While the model seems to learn distributions that are noticeably useful, 
it is still far from perfect. It almost exclusively finds nouns as content words when 
verbs are often equally as important in semantic topics. This could be due to the use 
of the HMM that learns to discern nouns from other parts-of-speech and simply sees 
other semantically-important types of words as outliers that are pushed to the syntactic 
classes. 

Another recent approach at combining S5mtax and semantics into a coherent prob- 



abilistic generative model is the S5mtactic Topic Model (STM) (Boyd-Graber and Blei 
[2010). Unlike the HMMLDA model, where a word is deemed to either come from a 
corpus-wide syntax class or a semantic-based topic, the STM discovers topics that are 
both syntactically and semantically coherent. This is more directly in line with our goals 
for the POSLDA model. As for the motivation in combuiuig these notions of word 
information, Boyd-Graber and Blei provide an edifying example: 
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Table 1 

Example Topics from the STM 



hates 


bucks 


runs 


professor 


stock 


dreads 


surges 


falls 


phd 


share 


mourns 


climbs 


walks 


candidate 


mutual 


fears 


falls 


sits 


grad 


fund 


despairs 


runs 


climbs 


student 


on 



Next weekend, you could be relaxing in . 

There are two distinct text-modeling based approaches one could use to reason about 
what word might be used to fill in the blank. With a topic model such as LDA, where 
this document is about travel, high probability words might include "sailing", "Rome", 
or "flight". With a S3mtax model, it might determine that the missing word should be 
a noun, and thus high probability words could include "bed", "church", or "school". 
However, as is explained in ( |Boyd-Graber and Blei 2010) , the best candidate to fill in 
the blank is an intersection of these two t5^es of reasoning. The word "sailing" matches 
the topic related aspect of the sentence (travel), but does not fit syntactically (verb). The 
opposite is the case for the noun "school". However, when both S5mtactic and semantic 
notions of language are taken into account, a word such as "Rome", which has high 
probability both in the travel topic and in the noun syntax class, will be selected with 
high probability. 

The STM is a non-parametric model where the number of topics is not set a priori but 
is determined, through posterior inference, by the data. While the standard LDA model 
draws the document topic portions for a document from a X-dimensional Dirichlet 
distribution with a fixed value of K topics, here the transition distributions tt and 
document topic portions 9ii are drawn from a Dirichlet Process (DP), where a vector f3 of 
infinite length is a global weight that is drawn from a stick-breaking distribution and is 
used as a base measure for the DP. This approach frees users of the model from having 
to determine themselves how many topics a corpus might contain. Our POSLDA model 
can also easily become a nonparametric Bayesian model by incorporating a hierarchical 
Diriclet Process (HDP) prior (Teh et al. 2006 )[^ In this formulation, the number of topics 
can be learned from the data in the sense that through inference we can learn the optimal 
number of topics K that will maximize the likelihood of the data. 

The generative process for the STM is as follows: 

1. Draw global weights (3 ^ GEM(a) 

2. For each topic index k — {1, ...}: 

(a) Draw topic ^ Dir(<T/3,tj) 

(b) Draw transition distribution Tr^ ^ UP(aT, /3) 

3. For each document d — {1, M}: 

(a) Draw document weights 9d ^ DF{a£,,(3) 

(b) For each sentence root node with index (d, t) e SENTENCE-ROOTS^: 

i. Draw topic assignment z^.r oc OdTTstart 

ii. Draw root word w^.r ~ 



1 The approach to do so specifically for POSLDA is outlined in §3.4 of 
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(c) For each additional child with index (d, c) and parent with index {d,p): 

i. Draw topic assignment ^ Od'^za p 

ii. Draw word w^.c ^ Tz^ „ 

While this generative process is clearly much more complex than the corresponding 
one for simpler models such as LDA and HMMLDA, it is also more powerful. The key 
steps are 3(b)(i) and 3(c)(i) where the topic assignment for a word is chosen to be a 
convolution of the long-range semantic topic portion 6^ and the short-range S5mtactic 
probability tt^^ ^ . Some example semantically and S5mtactically coherent topics learned 
from the STM are shown in Table [T] Note that in each case, the high probability words 
are both S3mtactically equivalent (noun, verb, etc.) and semantically related in a topic 
modeling sense. 

While the posterior distributions learned by the STM appear to fall in line with 
our goals of a syntactically-cognizant generative topic model, it suffers from certain 
limitations that we would like to address. First, the generative story depends on a form 
of meta-sentence structure that exists before the words have been generated. That is, 
texts are generated based on the probabilities in sentence-specific dependency parse 
trees ( |Boyd-Graber and Blei 2010 1. Therefore, to perform inference with the STM - and 
recover the posterior distributions like those in Table [T| - the data must be separately 
pre-processed into sentence dependency parses. This means that the model is not 
learning the S3mtax patterns itself, but is using information that must be supplied by 
a separate algorithm before inference is performed. Conversely, we are interested in a 
fully generative model that is consistent across syntax and semantics where inferring 
the short-range sentence-wide dependencies forms part of the model. 

In the next section we will present our model, POSLDA. It is a strong generalization 
of HMMLDA in that it follows the idea that we can combine an HMM with an LDA- 
like topic model, but it takes the idea further so that topic-dependent words can also be 
influenced by different syntax classes and thus learn part-of-speech specific posterior 
distributions. 



3. POSLDA 



While LDA is in some sense a simple extension of probabilistic latent semantic analysis 
( [Hofmann 2001| l, it can be seen as the first fully generative topic model by virtue of 
the Dirichlet prior that is placed on the document-topic portions which in effect free 
the model from specific training data. Since its inception, LDA has been extended in 
numerous ways and particularly by infusing the model with additional factors. Word 
distributions can become more specific if we consider that generated words are depen- 
dent on not only the current topic, but also other latent aspects such as the sentiment 



of the writing and the writer's personality or ideological perspective (Paul and Girju 



2010 Ahmed and Xing 2010). This allows one to uncover such word distributions as 



"positive/ negative words about films" or "words about weather from the perspective of 
Americans/Swedes/Australians". In fact, this approach is so powerful that it has been 
generalized into techniques that can easily add specific factors to topic models through 
the use of strong prior information (Paul and Dredze 2012 1. However, to include word 
syntax, we need a different approach because this factor does not come out of the types 
of words that are used, but their order. 

Part-of-Speech LDA (POSLDA) is an extension and generalization of LDA and 
HMMLDA that is designed to understand the long- and short-range dependencies 
between words, and as a tool for more complex NLP tasks that require both seman- 
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Figure 1 

Graphical model depiction of POSLDA (left) and Labeled POSLDA (right) for unsupervised POS 
tagging. 



tic and syntactic information to attain optimal performance. In an HMM, words are 
considered independent of their wider context within a document, but depend on the 
classes of the words that appear before them. Therefore, the word order in a syntax 
model is important and the bag-of-words representation used in canonical topic models 
is no longer appropriate. Because both types of word information are important, and 
modeling each separately entails certain restrictions, we seek to bridge these restrictions 
with a unified model of language, POSLDA. 

Under POSLDA, each word token is now associated with two latent variables: a 
topic z and a S5mtactic class c. We posit that the topics are generated through the LDA 
process, while the classes are generated through a Bayesian HMM. The observed word 
tokens are then generally dependent on both the topic and the class: rather than a single 
discrete distribution for a particular topic z or a particular class c, there are distributions 
for each topic-class pair (2;, c) from which we assume words are sampled. However, 
there also exists a set of "s5mtactic-only" words that do not depend on the thematic 
context of a document ([ Griffiths et al. 2005) . These words - such as determiners, prepo- 
sitions, and conjunctions - are often called "function" words and should be modeled as 
"universal" syntax classes that are not affected by - and are not assigned - a latent topic. 

We therefore are interested in a generative model of text where all generated words 
depend on the function that they perform in a sentence, and a subset of these words 
also depend on the current semantic topic. For the HMM-like portion of the model, 
we denote the set of classes C = Csem U Csyn/ which includes the set of semantic classes 
CsEM and the set of S5mtactic (function word) classes Csyn- If a word is generated from 
a function word class, it does not depend on the topic. This allows our model to 
accommodate functional words that appear independently of the topical content of a 
document. 

We use a similar notation to LDA, where Od is a document-topic portion and is 
a word distribution. Additionally, we denote the HMM transition matrix tt, which we 
assume has rows that are drawn from a Dirichlet distribution with hyperparameter 7. 
Denote S" = |C| and T = \Z\, the numbers of classes and topics, respectively. There are 
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5sYN word distributions ^^^^'^^ for function word classes and T x S'sem word distribu- 
tions for semantic classes. A graphical model depiction of POSLDA is shown in 
Figurejl] This figure also denotes a slight variation to the model called Labeled POSLDA 
which is analogous to Labeled LDA for the original LDA (Ramage et al. 2009). Here, an 
observed dictionary denoted by A restricts the classes that certain words can take on. We 
will make use of this model for dictionary-based unsupervised part-of-speech tagging. 

The generative process for a corpus of documents T) under the POSLDA model is 
described as follows: 

1. For each row tt^ G tt: 

(a) Draw ^ Dirichlet(7) 

2. For each word distribution G (p: 

(a) Draw 0^ ~ Dirichlet(/3) 

3. For each document d G V: 

(a) Draw 9d ^ Dirichlet(Q;) 

(b) For each word token Wi G d: 

i. Draw q 7rc,_i,...,ci_„ 

ii. If Cj ^ CsEM- 

A. Draw Wi ~ <p'c™'' 
iii. Else: 

A. Draw Zi ~ 9d 

B. Draw Wi ^ (/'i;™'' 

In traditional topic models, it is generally the case that common function words will 
overwhelm the word distributions, leading to suboptimal results and learned word 
distributions that are difficult to interpret. This problem is often skirted by either data 
pre-processing (e.g. removing stop words from a domain-depen dent list) (|Blei, Ng, and 



Jordan 20031, backing off to "background" word models (Chemudugunta, Smyth, and 
Ste5rvers 2006: Paul and Girju 2010 1, or by performing term re-weighting < |Wilson and 



Chew 2010). In the case of POSLDA, these common words are naturally explained by 



the corresponding function word classes and are pushed to these distributions rather 
than the topic-specific distributions during learning. 

3.1 Relations to Other Models 

The idea of having discrete word distributions for the cross product of topics and classes 
is related to multi-faceted topic models where word tokens are associated with multiple 
latent variables ( [Paul and Girju 2010{|Ahmed and Xing 2010||Paul and Dredze 2012| . 



Under such models, words can be explained by a latent topic as well as a second (or 
Tith) underlying variable such as the perspective or dialect of the author, and words 
may depend on both (or multiple) factors. In our case, the second variable is the part- 
of-speech - or functional purpose - of the token. 

POSLDA is also similar to a recent model called Nested HMM-LDA (nHMMLDA) 



< Jiang 2009| . The model described is very similar to POSLDA but contains certain 



limitations. Principally, rather than allowing each word to be generated from any of K 
topics, all words from a sentence must share the same topic. This is a strong assumption 
since it will not allow a sentence to discuss more than a single topic. 

POSLDA is constructed in a generalized manner and contains many existing mod- 
els as special cases. For example, POSLDA reduces to a Bayesian HMM when the 
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number of topics K =^ 1, the original LDA model when the number of classes S = 1, 
or the HMMLDA model when the number of semantic classes S'sem = 1- One of the key 
benefits of these reductions is that one can easily experiment with any of these models 
using a single POSLDA implementation by simply altering the necessary parameters. 
POSLDA can also easily reduce to the nHMMLDA model by forcing all words in a 
sentence to share the same topic. 

3.2 Approximate Inference 

The principal computational problem in probabilistic topic / S5mtax models is posterior 
inference (Blei and Lafferty 2009). As it is based on LDA and the Bayesian HMM, 
exact inference in the POSLDA model is also intractable. Therefore, following many 
others, we make use of the MCMC-based approximate inference technique collapsed 
Gibbs sampling (Griffiths and Ste5rvers 2004: Heinrich 2004 1. Here, the multinomial 
parameters are first integrated out and we directly sample the indexing latent variables 
a and Zi. In POSLDA, if the class is designated as syntactic, then it only depends on 

(c- Z-) 

the class. We therefore introduce the counts Uw" which correspond to the number of 
times that word Wi is assigned to class Ci and topic Zi. Our sampling equation is then as 
follows: 




p{c.,z.\c_i,z_,^w) (X <[ ■ ' (1) 



where 



Hc.-2,C.-l) + 7. "(C.-IX,) + 7. '^(C,,C.+ 1) + 7. 



(2) 



Note that we sample the pair (cj, Zi) jointly as a block, which requires computing a 
sampling distribution over /SsYN ~\~ T Y. .S'sem- It would also be valid to sample and z^ 
separately, which would require only S + T computations, in which case, the sampling 
procedure would be somewhat different. Despite the lower number of computations per 
iteration, however, the sampler is likely to converge faster with our blocked approach 
because the two variables are tightly coupled. The intuition is that a non-block-based 
sampler could have difficulty escaping local optima because we are interested in the 
most probable pair; a highly probable class c sampled on its own, for example, could 
prevent the sampler from choosing a more likely pair (c', z). 

One problem with MCMC-based methods is that assessing convergence can be 
difficult. We do not address specific approaches to get around this issue, but we note that 
a stabilizing likelihood can be used to infer convergence. Generally this happens after 
several hundred iterations (depending on the size of the dataset), and for the datasets 
used in this article, the likelihood will converge at about 2,000 iterations. Finally, we are 
interested in deriving point estimates for the topics 0^^'^-' from the sampled statistics. 4> 
is a 3-dimensional array where ^L?-^' = p(wi|c, z). Therefore, following we get 



9 



Computational Linguistics 



Volume XX, Number xx 



5800000 - 
6000000- 












































6200000 - 
















6400000 - 
6600000- 















































1000 2000 3000 4000 5000 



iteration 
Figure 2 

Gibbs iterations vs. log-likelihood while learning a POSLDA model from the TREC AP dataset 
with K = 50,S= 10, SsEM = 5. 



4. Experiments and Results 

In this section we present a set of experiments on the POSLDA model to demonstrate its 
capabilities as a topic and syntax model of language. We demonstrate both qualitatively 
and quantitatively the model's ability to capture the semantic and syntactic axes of 
information prevalent in a corpus. We begin qualitatively with topic interpretability 
and then present quantitative results on the ability of POSLDA as a predictive language 
model. Following this, we show its ability as a model for performing unsupervised POS 
tagging. 



4.1 Topic Interpretability 

Judging the interpretability of a set of topics is highly subjective. Chang, et al. look 
at "word intrusion" where a user determines an intruding word from a set of words 
that does not thematically fit with the other words, and "topic intrusion" where a user 
determines whether the learned document-topic portion 9d appropriately describes the 
semantic theme of the document ( [Chang et al. 2009| . Here, we are mostly interested 
in subjectively demonstrating the low incidence of "word intrusion" both in terms of 
semantics (theme) and syntax (part-of -speech). We subjectively demonstrate that our 
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"law" 






"finance" 






"health" 




adj 


verb 


noun 


adj 


verb 


noun 


adj 


verb 


noun 


federal 


filed 


attorney 


stock 


rose 


exchange 


health 


died 


study 


court 


ruled 


judge 


wall 


averaged 


stock 


medical 


suffered 


research 


supreme 


agreed 


district 


bond 


issued 


securities 


aids 


received 


hospital 


legal 


contends 


calif 


million 


fell 


dow 


drug 


underwent 


virus 


civil 


claims 


county 


american 


gained 


york 


blood 


found 


report 


appeals 


contended 


board 


financial 


dropped 


inc 


heart 


carried 


disease 


tax 


refused 


loan 


composite 


rated 


totaled 


research 


suffers 


university 


illegal 


sued 


san 


common 


traded 


drexel 


immune 


leaves 


doctor 


government 


won 


court 


business 


stocks 


commission 


hospital 


kills 


person 


financial 


wrote 


justice 


dow 


closed 


lambert 


cancer 


took 


patient 



Table 2 

Example topics learned from the TREC AP dataset with POSLDA. 



model learns semantic and syntactic word distributions that are likely robust towards 
problems of word intrusion]^ 

To demonstrate the effectiveness of POSLDA's semantic-syntactic pattern recogni- 
tion ability, we fit a number of models to different datasets. We demonstrate results 
both on more "traditional" news corpora such as TREC AP and the WSJ, and on more 
esoteric datasets such as collections of tweets from the microbloggrng website Twitter 
and collections of legal decisions from the Supreme Court of Canada]^ We begin with 
a standard demonstration of topic interpretability on news data from the Associated 
Press. 

4.1.1 Traditional News Data. Table |2] shows three topics - manually labeled as "law", 
"finance", and "health" - learned from a 2,250 document subset of the TREC AP corpus 
( [Harman 1992 1. We set the number of topics K = 30, the number of classes S = 17, and 



the number of semantic classes S'sem = 7|^We show the top ten words from three POS- 
specific topics labeled manually as adjective, verb, and noun. The interpretability of the 
topics and the cohesiveness of the terms with high probabilities is clear. All three topics 
assign high probabilities to words that one would expect to have high importance. More 
importantly, however, the POS-specific topics also clearly reflect their S5mtactic roles. 
Each of the verbs is assuredly (even without the proper context) a verb, and the same 
thing for the nouns. The adjectives seem to fit as well; though many of the words could 
be considered nouns depending on the context, it is clear how given the topic each of 
the words can very well act as an adjective. A final point worth mentioning is that, 
unlike LDA, we do not perform stop-word removal. Instead, the POSLDA model has 
pushed stop-words to their own syntactic classes (rather than semantic) freeing us from 
having to perform pre- or post-processing steps to ensure interpretable topics. The top 
words in four of these topic-independent S5mtactic classes are shown in Table |3] with 
manually -labeled class names. 



2 This is in line with our approach of viewing topic models as tools for performing other tasks. We are most 
interested in objective quantitative results learned from applying our models to NLP tasks such as POS 
tagging which we demonstrate later in this section. 

3 http://scc. lexum .org/en/ index. html 

4 We choose this number bas ed on intuition. We imagine that of the 17 often-delineated parts-of-speech 
(Goldwater and Griffiths 2007 1, 6 or 7 will generally be theme-specific. These include adjectives, verbs, 
gerund or present participie verbs, adverbs, nouns, and past participle verbs. It can be helpful to include 
an "extra" semantic class to see if the model can find patterns of semantics-syntax that we might miss. 
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AUXILIARY 


CONJUNCTION 


DETERMINER 


RELATIVE 


is 


and 


the 


that 


was 


but 


a 


which 


be 


or 


an 


who 


are 


& 


this 


when 


has 


so 


some 


what 


have 


both 


such 


how 


will 


times 


any 


where 


would 


nor 


many 


whose 


says 


plus 


those 


why 


were 


yet 


these 


whom 



Table 3 

Example topic-independent syntax class distributions (Csyn) learned from the AP dataset with 
POSLDA. 





"energy" 




"corporations" 




adj 


verb 


noun 


adj 


verb 


noun 


chemical 


clean 


plant 


executive 


named 


president 


power 


waste 


plants 


vice 


holding 


company 


environmental 


attack 


agency 


operating 


banking 


officer 


water 


adore 


utility 


financial 


managing 


chairman 


electric 


exist 


facility 


management 


succeed 


board 


energy 


create 


commission 


company 


succeeds 


directory 


air 


combust 


CO. 


board 


named 


post 


pont 


protect 


industry 


former 


reacquired 


unit 


safety 


mine 


environment 


investment 


created 


executive 


gas 


insult 


corp. 


division 


centers 


executives 



Table 4 

Example topics learned from the ACL_DCI WSJ dataset with POSLDA. 



Next we look at the ACL_DCI release of the WSJ treebank dataset which contains 
approximately 3 million words over 6,058 documents. We turn to this dataset both to 
show further results of POSLDA's pattern recognition ability along the axes of both 
semantics and syntax, and to demonstrate the scalability of the model to a larger corpus. 
Because this is a much larger dataset than the TREC AP corpus, we set the number 
of topics K = 50, but leave the class parameters untouched. Note that this approach 
to "guessing" the number of topics represented by a dataset is a typical way to begin 
understanding the make-up of a collection of documents. With the parametric version 
of POSLDA, we can use perplexity as calculated on a held-out test set to help determine 
the best number of topics and this is explored in the following subsection. For a more 



principled approach, we can use an HDP prior over the number of topics I Teh et al. 
2006t . 



Table|4]shows two topics learned on the WSJ dataset with the parameters described 
above. Again, we show each topic as three POS-specific topics learned from POSLDA: 
verb, noun, and adjective. We show some of the more interpretable part-of-speech- 
specific topics, but in general the learned distributions appear noisier than those found 
on the smaller AP dataset. Nevertheless, the found topics are subjectively interpretable. 

4.1.2 Domain Specific Data. While corpora of newswire documents are prevalent in 
NLP research due to their contained articles' proclivity, there are a number of other 
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"criminal" 






"labour" 
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verb 


noun 


adj 


verb 


noun 


criminal 


appeal 


accused 


labour 


appeal 


employer 


mens 


respect 


offence 


collective 


finding 


employee 


bodily 


committing 


person 


employment 


issue 


union 


actus 


section 


code 


union 


respect 


board 


reasonable 


determining 


crown 


bargaining 


section 


code 


subjective 


relating 


act 


trade 


determining 


agreement 


mental 


doing 


conviction 


individual 


join 


employees 


objective 


carrying 


law 


employee 


colleague 


court 


unlawful 


proof 


defence 


agricultural 


lester 


relationship 


common 


aiding 


crime 


construction 


dismissing 


position 




"insurance" 






"family" 




adj 


verb 


noun 


adj 


verb 


noun 


insurance 


appeal 


policy 


spousal 


appeal 


court 


insured 


respect 


insurer 


child 


account 


spouse 


hypothecary 


effect 


insured 


family 


determining 


parties 


disability 


insure 


contract 


support 


respect 


agreement 


unemployment 


defend 


clause 


economic 


regard 


marriage 


exclusion 


issue 


claim 


financial 


consider 


act 


Ufe 


accord 


risk 


matrimonial 


considering 


wife 


automobile 


indemnify 


insurance 


pension 


accord 


pension 


third 


force 


premium 


married 


subsection 


relationship 


standard 


insuring 


beneficiary 


marriage 


section 


husband 



Table 5 

Exannple topics learned from the Supreme Court of Canada decisions 1989 to 2009 with 
POSLDA. 



domains with large collections of data that will benefit from new statistical models for 
corpus exploration, data mining, and other text-related tasks. These areas may include, 
inter alia, public health, economics, and law. Studying domain-specific corpora can be 
illuminating in text modeling research because often domains are filled with eccentrici- 
ties that do not show up in general writing. These include domain-specific vocabularies 
and specialized writing structures. Law is a particularly relevant field because of the 
abundance of textual data that is produced in the field. Here, we demonstrate the 
qualitative effectiveness of modeling a collection of Supreme Court of Canada decisions 
with the POSLDA model. 

We choose K = 40, S* = 17, and 5sem — 7, as above. Four of the uncovered topics 
broken into adjective, verb, and noun are displayed in Table |5] Once again, the POS- 
specific topics are clear and interpretable. In the subtopic for "criminal law" adjec- 
tives, for example, there are a number of first-words from some common criminal 
law phrases. These include "mens" from the common phrase mens rea (the mental 
component required to commit a crime), "actus" from the common phrase actus reus 
(the physical act component required to commit a crime), and "reasonable" which 
often modifies the word "doubt" to form the phrase reasonable doubt. Furthermore, the 
nouns in the criminal law topic help make clear that domain specific data have been 
understood properly. The most probable word in the subtopic for "criminal law" nouns 
is "accused" which might naively be tagged as a verb in a dictionary-only based method 
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that has little understanding of context. Here, it is correctly placed in the noun subtopic 
because the accused is the person that stands accused of a crimej^ 

As with other datasets there is some noisiness in the results such as the word 
"section" appearing in the verb subtopic for the criminal, labour, and family law topics. 
This is likely due to the fact that in legal decisions phrases describing the origin of laws 
typically have their own sort of quasi-grammar and "section" is a common word in this 
context. An example is found in Winters v. Legal Services Secret where our sentence 
splitter found the sentence "The Requirements of Section 3(2)". For this to be a proper 
sentence it requires a verb. The model has likely decided that the word "section" would 
work the best as a verb in this context and it is therefore likely an error attributable to 
the sentence splitting algorithm as opposed to the POSLDA model. Another problem 
is that the verb "appeal" has shown up as the most important verb in every topic. This 
is an interesting issue because typically we have seen rndistrnguishing words pushed 
to the syntax-only classes. One likely reason that this has not happened is that, though 
the word is important in many topics, it is not important in all of the uncovered latent 
topics. Appeal is also an interesting word in this domain because it is used to a great 
extent as both a verb (to appeal a decision) and as a noun (the appeal in question). While 
appeal is a common noun, it does not show up as any of the top ten nouns in any of the 
displayed noun subtopics. Despite these slight inconsistencies, however, the POSLDA- 
learned topics from the SCC dataset are interpretable and clear. 

4.1.3 Noisy Data. Recently, there has been a large interest in mining the vast quantity of 
text data created every day though the popular microbloggrng service Twitterj^ There 
are a number of unique challenges associated with analyzing this kind of data, however, 
that make it different from the datasets studied above. First, unlike the highly-structured 
news articles in corpora such as TREC AP and WSJ, Twitter messages ("tweets") are 
rarely well-formed sentences. Proper grammar (or even anything close to it) is all but 
abandoned on Twitter and the rules of punctuation have been seemingly reinvented 
< [Ramage, Dumais, and Liebling 2010) . One of the principal reasons for this style of 
writing is that tweets are constrained to a maximum of 140 characters. This of course 
poses its own issues with respect to text modelling as short documents will contain 
less thematic information. In addition, partially due to character-limit constraints and 
partially because Twitter is very popular amongst young people, proper spelling is rare 
in such a dataset (Ritter, C herry, and Dolan 2010| . Each of these issues poses unique 
problems for modelling topics in this area. Because the long-range thematic dependen- 
cies in POSLDA are determined by word co-occurrence, multiple spellings of the same 
word can hinder unsupervised topic recognition. On the other hand, because the short- 
range S5rntactic dependencies in POSLDA are learned by understanding common word- 
class transitions, structureless grammar also causes problems in determining parts-of- 
speech. 

Despite the issues outlined above, however. Twitter is a very interesting resource. 
It represents the up-to-the-minute thoughts of millions of people across the world and 
the knowledge that can be learned from this data is likely immense. One of the most 
interesting analyses of Twitter data so far is to use a supervised topic model to learn 
current issues about public health such as what health problems are being experienced 



5 See, e.g. Black's Law Dictionary (2d Pocket ed. 2001). 

6 Winters v. Legal Services Society, [1999] 3 S.C.R. 160. 

7 http://www.twitter.com 
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"media" 


"relationships" (?) 


"movies" (?) 


verb 


noun 


verb 


noun 


verb 


noun 


phone 


photo 


spent 


girl 


watch 


online 


posted 


video 


headed 


party 


looking 


movie 


speak 


smile 


cheating 


sex 


bored 


script 


looking 


phone 


visit 


bitches 


seen 


series 


send 


night 


annoy 


sleep 


walk 


avatar 


follow 


time 


callin 


eyes 


thats 


funny 


listen 


media 


wanna 


love 


respect 


cool 


chat 


happy 


hang 


rings 


cleared 


bad 


love 


chick 


fck 


games 


announce 


girl 


heart 


da 


sucks 


date 


wow 


dollar 



Table 6 

Example topics learned from Twitter with POSLDA. 



by whom (and where) and how people are treating those problems (Taul and Dredze 



|2011|. Here, we simply demonstrate that without any additional machinery, POSLDA 



can learn semantically and syntactically consistent topics from a collection of tweets. 
We use the Twitter POS dataset released at ACL 2011 by Gimpel, et al. which consists 
of approximately 26,000 words across 1,827 tweets ( [Gimpel et al. 20lTl. While this is 



a fairly limited collection of data, our model is nevertheless able to uncover some 
interesting POS-specific topics. Table |6] shows three manually-labeled topics learned 
with the settings of K = 20, S" = 17, and 5sem = 7. 

The topics outlined in Table |6] are by far the noisiest and least rnterpretable demon- 
strated thus far. Because of the non-standard text structure and issues with incorrectly 
spelled words, it is difficult for the algorithm to uncover the patterns of interest. Nev- 
ertheless, of the 20 topics, three fairly rnterpretable topics are demonstrated. As it is 
difficult to tell exactly which part-of-speech each subtopic represents, we only list two 
distinct subtopics per topic: verb and noun. The first topic - "media" - is likely the 
most coherent while the other two have been labelled with trailing question marks to 
convey that it is difficult to name the topics. It is for this reason that Twitter-specific topic 
models (or at least Twitter-specific alterations to existing topic models) may be required 
( [Ramage, Dumais, and Liebling 2010 Ritter, Cherry, and Dolan 2010[ |. Analyzing the 



listings in Table |6J does show, however, that POSLDA can uncover interesting semantic 
and syntactic patterns even in this highly noisy source. 

4.2 Quantitative Results 

There are several methods that are commonly employed to evaluate novel probabilistic 
models in the literature < Wallach et al. 2009| . The original LDA paper - and many others 



use the perplexity which is a standard metric in the information retrieval literature (jBlei, 



Ng, and Jordan 2003 Teh et al. 2006} . A probabilistic model can also be evaluated by 



considering its performance on an extrinsic task. Here, we first focus on the perplexity 
and use it to measure POSLDA's performance as a predictive language model, and then 
discuss using the model for unsupervised POS tagging. 

4.2.1 Predictive Language Modelling. Following the standard practice in topic model- 
ing research <|Blei, Ng, and Jordan 2003{|Griffiths et al. 2005HTeh et al. 2006||Zhu, Blei, and 



Lafferty 2006) , we fit a model to a training set and compute the perplexity of a held-out 



test set. The perplexity can be defined as the predicted average number of words that 
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are equally likely to be generated for a given position (Zhu, Bl ei, and Lafferty 2006) . It 
is also a monotonically decreasing function of the log likelihood. 



The perplexity of a held-out test set Vtest — {^d)dLi is given as: 



Ed lilogP(w, |Al)' 



ppx{Vtest) = exp I I (4) 



where M represents the model parameters learned from the training data, p{wd\M) is 
the probability (likelihood) of document given the learned model parameters, and 
Nd is the number of words in document w^. 

We test the POSLDA model on a subset of the AP TREC dataset. We use ten-fold 
cross-validation where the data is split into 10 subsets of equal size. We conduct 10 
experiments where one of the subsets is held out for testing and the model is trained on 
the remaining 9 subsets. We report the average over the 10 experiments. We compare 
the perplexity of POSLDA to the original LDA model, a Bayesian HMM, and Griffiths, 
et al.'s HMMLDA. Each model is trained using 10,000 iterations of Gibbs sampling. 
We use asymmetric priors on the document-topic portions 6 and the rows of the 
HMM transition matrix tt. Following (WaUach, Mimno, and McCallum 2009"), we use 



a symmetric prior on the topic-word distributions because having asymmetric values 
on the topics themselves does not seem to lead to improved performance. The prior's 
hyperparameters are optimized using Minka's fixed-point method (IWallach 2008). For 
these experiments we use 5* = 10 classes of which 5 are designated as semantic. By 
definition, the HMMLDA model has SS = 1 and the LDA model has S = Ssem = 1- The 
HMM model does not consider the number of topics K. 



model 
-«- HMM 
* HMMLDA 
-■■ LDA 

-h POSLDA 



Figure 3 

Perplexity of POSLDA and other similar probabilistic models as K varies. 

Figure [3] shows the average perplexity values on a held-out test set for a number of 
models in the same family as POSLDA over a range of topic values. The HMM achieves 
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lower perplexity values than LDA at all topic settings K = {5, 10, 15, 20, 25, 30}. All 
three topic-based models realize perplexity improvements as the number of topics K 
increases. Both HMMLDA and POSLDA - which combine the benefits of the HMM 
and LDA models - result in lower perplexity values. POSLDA's additional flexibility in 
terms of the additional semantic classes allows it to record the lowest perplexity values 
of all the models tested for each topic setting. 




5 10 15 20 25 30 

States 



Figure 4 

Perplexity of POSLDA and other similar probabilistic models as S varies. 

While Figure |3] shows how the POSLDA model's ability to generalize on unseen 
data is affected by the number of topics, Figure|4]illustrates the changes in the perplexity 
when the number of classes S is the independent variable. While it may not reflect the 
best possible values for the POSLDA model, we set the number of semantic classes 
S'sEM = S. Interestingly, the HMM's perplexity starts to shoot up when S = 15, and both 
HMMLDA and POSLDA show poor perplexity when S = 20. This likely reflects too 
much variation in the model (especially with no disambiguation between semantic and 
syntactic classes). Next, we look at how the perplexity varies when S is held fixed but 
S'sEM is free to vary. 

Figure |5] shows the average perplexity as the number of semantic states Ssem is 
varied. We set 5 = 10 and investigate how different settings of S'sEM affect the model's 
ability to generalize. When Ssem = the model reduces to a Bayesian HMM and when 
Ssem = 1 it becomes the HMMLDA model. Adding some semantic information with 
HMMLDA improves the perplexity considerably (1172 to 957) and further distinguish- 
ing semantic information in POSLDA continues to improve the perplexity until it 
reaches a minimum at S'sEM = 6. This reflects the fact that certain classes of words such 
as conjunctions are not aided by thematic information. 

Above we have demonstrated both that POSLDA tends to learn interpretable topics 
that are part-of-speech specific, and that it can lead to better predictive performance 
than other similar generative probabilistic models. The purposes of the models are 
clearly different, however. POSLDA is more flexible, but it requires a more expensive 
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semantiG_states 



Figure 5 

Perplexity of POSLDA as Ssem varies. 



representation of text (a sequence of words rather than a bag). HMMs are also extremely 
useful and versatile for a number of applications not necessarily related to text. Next 
we are interested in applications where the POSLDA model is truly needed or can truly 



make a difference. In (Darling and Song 2011 1 we showed how a POSLDA-like model 



can improve the results in a text summarization task. In the next section, we concentrate 
on unsupervised part-of-speech tagging. 

4.3 Unsupervised POS Tagging 

Goldwater and Griffiths show that Bayesian HMMs increase the accuracy of unsuper- 
vised POS tagging by up to 14 percentage points over the MLE approach ( Goldw ater] 
and Griffiths 2007[ |. While these results are impressive, unsupervised approaches con- 



tinue to fall well short of the accuracy obtained with supervised taggers. Nevertheless, 
unsupervised approaches are preferred in many situations especially when there is no 
access to large quantities of training data in a specific domain or particular language. 
We therefore aim to continue improving accuracy with unsupervised approaches by 
introducing semantics as an additional source of information for this task. 

The word "seal" appears both as a verb (to seal a jar or a leak) and as a noun (the 
marine mammal Pinniped) in the WSJ treebank dataset. The HMM approach can often 
tag each of these occurrences appropriately given the context, but there are cases where 
it will fail. However, if the topic being discussed is marine biology, we have another 
piece of evidence that increases the likelihood that this occurrence of the word "seal" 
is a noun about the marine mammal. If the topic is about pickling or roofing, however, 
"seal" as a verb is given more evidence. Another example is the word "book". In a 
literary context it will almost always take the form of a noun. However, in a topic about 
promotions or services, the word is more likely to function as a verb: "to book a hotel". 
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Following this intuition, we use the POSLDA model with the number of HMM states 
set to the number of possible tags in the tag set and use the state index learned through 
posterior inference as the predicted tag for each word. 

We take two approaches to perform unsupervised POS tagging using the POSLDA 
model. The first approach uses a tag dictionary containing varying amounts of infor- 
mation on the possible tags that certain words have taken on in the training data. 
This renders the problem to a case of POS disambiguation rather than pure unsuper- 
vised tagging. It is, however, the most common approach to demonstrating results in 
iinsupervised tagging ( [Goldwater and Griffiths 2007) . The second approach is a pure 
unsupervised method that implements POS clustering. Because we cannot know which 
classes represent which parts-of-speech, we instead compare the learned clusters to the 
correct clustering where clusters are exchangeable. Accordingly, we use the variation of 
information (VI) fo r evaluating this task (|Meila 2007) . 

We showed in ( [Darling, Paul, and Song 2012^ that this model consistently beats the 
Bayesian HMM approach in the Twitter domain. Here, we show that these improve- 
ments hold in a more traditional domain: the ACL_DCI release of the Perm Treebank's 
collection of Wall Street Journal newswire articles. This dataset contains approximately 
3M words over 6K documents. We condense the tag set to the more standard for 

unsupervised POS tagging 17-tag set int roduced by ([Smith and Eisner 2005). 

We follow the established form of ( [Merialdo 1993) and ( [Goldwater and Griffiths 



2007) for unsupervised POS tagging by making use of a tag dictionary to constrain the 



possible tag choices for each word and therefore render the problem closer to disam- 
biguation. Like in ( Goldwater and Griffiths 200 7| , we employ a number of dictionaries 
with varying degrees of knowledge. A dictionary contains the tag information for a 
word only when it appears more than d times in the training corpus. We ran experiments 
for d — 1,2, 3, 5, 10, and oo where the problem becomes POS clustering. We report both 
tagging accuracy and the variation of information (VI), which computes the informa- 
tion lost in moving from one clustering C to another C": VI{C, C) = H{C) + H{C') — 
2I{C, C) (Meila 2007 1. This can be interpreted as a measure of similarity between the 
clusterings, where a smaller value indicates higher similarity. 

To properly make use of a tag dictionary, we slightly alter the POSLDA model 
by adding an additional observed random variable to the model as in Labeled LDA 
( Ramage et al. 2009 ). The tag dictionary is encoded in the model by a list of binary class 
indicators for each word A'^"'^ — (Au,^, A^^^,) where A^u = (^i, Zc) for each class. The 
model then becomes as depicted in Figure [T| (right). The sampling equation is then 
updated simply to p(2;i,Ci|c_i,2_i, to) (x p{zi,Ci\c_i, z_i,w) x Au,,c,- 

We run our Gibbs sampler for 20,000 iterations and obtain a maximum a posteriori 
(MAP) estimate for each word's tag by employing simulated annealing. Each posterior 
probability p{c,z\-) in the sampling distribution is raised to the power of - where r 
is a value (in traditional annealing used in physics r would be a temperature) that 
approaches as the sampler converges. This approach is akin to bringing a system 
from an arbitrary state to one with the lowest energy thus viewing the Gibbs sampling 
procedure as a random search whose goal is to identify the MAP tag sequence, as em- 
ployed in (Goldwater and Griffiths 2007). Finally, we run each experiment 5 times from 
random initializations and report the average accuracy and variation of information. All 
improved results reported for the POSLDA model are statistically significantly different 
from those achieved with the BHMM model as determined by a Student's t-test with 
p<C 0.01. 

The achieved results are s hown m Table[7|for a random ta g assignment, the Bayesian 
HMM (BHMM) described in ( [Goldwater and Griffiths 2007) , and our POSLDA tagging 
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Accuracy 


1 


2 


3 


5 


10 


oo 


random 


58.7 


57.9 


57.5 


56.7 
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BHMM 


86.4 


85.6 


85.3 


84.9 


84.5 




POSLDA 


87.7 


87.2 


86.9 


86.5 


85.9 




VI 














random 


2.37 


2.44 


2.48 


2.54 


2.63 


5.07 


BHMM 


0.77 


0.83 


0.86 


0.90 


0.90 


2.25 


POSLDA 


0.73 


0.77 


0.79 


0.82 


0.86 


2.30 


Corpus stats 














% ambig. 


66.0 


66.8 


67.4 


68.1 


69.3 


100 


tags / token 


2.34 


2.48 


2.57 


2.71 


2.95 
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Table 7 

POS Tagging Results on WSJ dataset. 



approach. For both BHMM and POSLDA, we optimize the asymmetric h5^erparame- 
ters a and 7 and the symmetric prior /? using either direct optimization or sampling with 
similar results. We use 6 semantic classes - adjective, verb, gerund or present participle 
verb, adverb, noun, and past participle verb - leaving 1 1 remaining for S5mtax words, 
and we report results for K ^ 10 topics. 




2 4 6 8 10 

d 



Figure 6 

POS Tagging Accuracy on WSJ dataset. 

These POS tagging results are also shown graphically in Figure |6] Our model 
outperforms the Bayesian HMM for all values of d in accuracy and all values of d but 
one for variation of information. In fact, POSLDA maintains higher tagging accuracy 
for d = 5 than the Bayesian HMM for d = 1. Our approach consistently surpasses the 
results achieved with BHMM by approximately 1.5 percentage points for each value of 



20 



Darling & Song 



POSLDA 



d. The one case where POSLDA did not improve upon BHMM is where d = oo for POS 
clustering. 

5. Conclusions and Future Work 

In this article we presented the combined topic and syntax model, Part-of-Speech LDA 
or POSLDA. We have also demonstrated its use as an improved model for performing 
imsupervised POS tagging. Our overarching goal is to demonstrate that combining the 
two axes of word meaning - S5mtax and semantics - into a coherent model can result 
in improvements to both learned topic distributions and to NLP tasks such as POS 
tagging. We showed that incorporating semantic information into the HMM model led 
to improved results for this task. Additionally, we showed that combining the two axes 
of word information results in a language model that achieves lower perplexity - and 
therefore better predictive capability - than other similar probabilistic models. 

In future work we would like to apply the POSLDA model to other NLP tasks 
that also rely upon learned word distributions. These include text summarization, text 
segmentation, and translation. An interesting avenue for further research is POSLDA's 
ability to serve as the base of a language generation system. While performing the 
LDA generative process would perhaps result in documents that contain words that 
are semantically related, we cannot say that it is generating language. POSLDA on the 
other hand - with some strong prior knowledge - may be able to generate coherent 
natural language because it exhibits more structure in the language generation process. 
We hope to explore this direction of research and apply it to tasks such as database 
population and abstract text summarization. 
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